perf(db): bulk CFG and dataflow DB writes via rusqlite by carlos-alm · Pull Request #653 · optave/codegraph

carlos-alm · 2026-03-27T06:09:24Z

Summary

Add bulk_insert_cfg() and bulk_insert_dataflow() Rust napi functions that perform CFG block/edge and dataflow edge inserts via rusqlite in single transactions, bypassing JS iteration loops
Add native fast paths in buildCFGData() and buildDataflowEdges() that collect batches from pre-computed native extraction results and delegate to Rust bulk inserters
Both paths fall back to existing JS code when the native addon is unavailable (WASM engine)

Details

Same optimization pattern as bulk_insert_ast_nodes (6.9). The DB write phases for CFG (161ms) and dataflow (125ms) were identical JS code on both engines — this moves them to Rust when native engine is active.

Rust side:

cfg_db.rs — resolves function node IDs, deletes stale data, inserts blocks+edges with block index→row ID mapping
dataflow_db.rs — pre-builds node resolution cache (local-first, global fallback), inserts flows_to/returns/mutates edges

Target: cfgMs + dataflowMs < 50ms combined on native full builds (current ~286ms)

Test plan

CFG integration tests pass (11/11)
Dataflow integration tests pass (23/23)
CFG all-langs parser tests pass (40/40)
Dataflow JS parser tests pass (13/13)
Dataflow engine parity tests pass (13/13)
Biome lint clean on all changed files
Verify Rust compilation in CI (local MSVC linker env issue prevented native build)
Benchmark cfgMs + dataflowMs on native full builds after CI produces the addon

Move AST node SQLite inserts from per-row JS iteration to a single native Rust transaction via napi-rs + rusqlite. The new bulkInsertAstNodes function opens the DB directly from Rust, pre-fetches parent node definitions, and inserts all rows in one transaction — eliminating the JS-native FFI overhead per row. The JS-side buildAstNodes tries the native fast path first (when all files have native astNodes arrays), falling back to the existing JS loop for WASM or mixed-engine scenarios. Target: astMs < 50ms on native full builds (was ~393ms).

The Rust connection omitted busy_timeout = 5000 which the JS-side connection.ts sets. Without it, SQLITE_BUSY is returned immediately on WAL contention instead of retrying for 5 seconds.

) bulkInsertAstNodes returns 0 for both "nothing to insert" and hard errors (DB open failure, SQLITE_BUSY, etc). Compare expected vs actual count and fall through to the JS path on mismatch so errors don't silently drop all AST nodes.

Explain why bundled is intentional: Windows CI lacks system SQLite, and dual-instance WAL coordination is OS-safe.

The Rust find_parent_id skipped definitions with end_line = NULL, but the JS findParentDef treats them as always-enclosing with a negative span (preferred over wider defs). This caused parent_node_id mismatches between native and JS paths.

Return 0 immediately on any insert_stmt.execute() failure so the transaction drops and rolls back, ensuring all-or-nothing semantics. Previously, .is_ok() silently swallowed row-level errors which could commit partial data and misfire the JS fallback causing duplicate rows.

…/codegraph into perf/ast-node-bulk-insert

Move CFG block/edge and dataflow edge inserts from JS iteration to Rust bulk operations, following the same pattern as bulk_insert_ast_nodes (6.9). Rust side: - cfg_db.rs: bulk_insert_cfg() resolves function node IDs, deletes stale data, inserts blocks+edges in a single rusqlite transaction - dataflow_db.rs: bulk_insert_dataflow() pre-builds node resolution cache (local-first, global fallback), inserts edges in a single transaction JS side: - cfg.ts: native fast path collects CfgFunctionBatch[] and delegates to Rust when all CFG is pre-computed by the native engine - dataflow.ts: native fast path converts DataflowResult (argFlows, assignments, mutations) into FileDataflowBatch[] for Rust insertion - Both fall back to existing JS paths when native addon is unavailable Target: cfgMs + dataflowMs < 50ms combined (from ~286ms with JS iteration)

greptile-apps · 2026-03-27T06:14:37Z

Greptile Summary

This PR extends the bulk-insert Rust optimization pattern (established in 6.9 for AST nodes) to the CFG and dataflow DB write phases, targeting a combined cfgMs + dataflowMs under 50 ms by bypassing JS iteration loops. The implementation follows the same shape as ast_db.rs: two new rusqlite-backed NAPI functions (bulk_insert_cfg, bulk_insert_dataflow), native fast-paths in buildCFGData / buildDataflowEdges, and JS fallbacks for WASM environments.\n\nKey findings:\n\n- CFG path is correct and safe. cfg_db.rs always deletes stale rows before inserting (per function_node_id), so the JS fallback path — which also deletes before inserting — is idempotent even after a partial Rust commit. The fallback condition processed === batches.length || expectedFunctions === 0 from the previous review cycle is correctly implemented.\n\n- Dataflow path has a data-integrity issue (P1). dataflow_db.rs does not delete existing rows before inserting. When any edge references a function name that can't be resolved (e.g. external lib calls), the Rust transaction commits a partial result and inserted < expectedEdges. The JS fallback then re-inserts all resolvable edges via a plain INSERT — the dataflow table has no UNIQUE constraint, so duplicate rows accumulate silently and appear in all dataflowData() query results. The fix mirrors the build-stmts.ts per-file DELETE FROM dataflow WHERE source_id IN (SELECT id FROM nodes WHERE file = ?) pattern and should be applied inside bulk_insert_dataflow before the insert transaction.\n\n- The bulkInsertDataflow fallback signal semantics are otherwise correctly addressed vs. the prior review thread: inserted === expectedEdges || expectedEdges === 0 is the right success gate and the code falls through on mismatch.

Confidence Score: 3/5

Not safe to merge as-is: the dataflow bulk-insert path can create duplicate DB rows in normal usage, corrupting query results.

Prior P1 feedback on fallback logic is well-addressed for both cfg.ts and dataflow.ts. The CFG implementation is correct and idempotent. However, a new concrete P1 remains: dataflow_db.rs commits partial results without a preceding DELETE, and the JS fallback re-inserts the same resolved edges into an unconstrained table. This is not a hypothetical — any project with calls to external/unresolvable functions will trigger it. A focused one-function fix (add DELETE-before-insert in bulk_insert_dataflow matching the build-stmts.ts pattern) would bring this to a 4 or 5.

crates/codegraph-core/src/dataflow_db.rs and src/features/dataflow.ts — the duplicate-on-fallback path.

Important Files Changed

Filename	Overview
crates/codegraph-core/src/cfg_db.rs	New bulk CFG inserter; correctly deletes stale rows before inserting per function_node_id, making JS fallback idempotent. Returns total of resolved-node functions. Minor: 0 is overloaded (empty-batch AND error), which is tolerable given caller guards.
crates/codegraph-core/src/dataflow_db.rs	New bulk dataflow inserter; local-first + global-fallback node resolution matches JS behavior. Missing: no per-file DELETE before inserting, so a partial-success commit followed by JS fallback produces duplicate rows in the dataflow table.
src/features/cfg.ts	Native fast path guarded by allNative; fallback check (processed === batches.length
src/features/dataflow.ts	Native fast path fallback logic addresses prior feedback, but falls through to a pure-INSERT JS path after a committed partial Rust result — no delete guard means duplicates in the dataflow table when unresolvable nodes cause inserted < expectedEdges.
crates/codegraph-core/src/lib.rs	Adds cfg_db and dataflow_db module declarations; straightforward change.
src/types.ts	Adds NativeAddon interface signatures for bulkInsertCfg and bulkInsertDataflow matching Rust struct definitions; no issues.

Sequence Diagram

sequenceDiagram
    participant JS as buildDataflowEdges (TS)
    participant Rust as bulk_insert_dataflow (Rust)
    participant SQLite as SQLite DB

    JS->>JS: Collect file batches from symbols.dataflow
    JS->>Rust: bulkInsertDataflow(db.name, batches)
    Rust->>SQLite: Open connection (READ_WRITE)
    Rust->>SQLite: Phase 1 — resolve node IDs into cache
    Rust->>SQLite: BEGIN TRANSACTION
    loop For each edge
        alt Node resolved
            Rust->>SQLite: INSERT INTO dataflow (...)
        else Node unresolvable
            Rust->>Rust: continue (skip, no delete)
        end
    end
    Rust->>SQLite: COMMIT
    Rust-->>JS: inserted count

    alt inserted === expectedEdges OR expectedEdges === 0
        JS->>JS: return (success)
    else inserted < expectedEdges (unresolvable nodes)
        note over SQLite: Partial edges already committed
        JS->>SQLite: JS fallback — INSERT same resolvable edges again
        note over SQLite: Duplicates created, no UNIQUE constraint
    end

_{Reviews (2): Last reviewed commit: "fix(db): add JS fallback on bulk-insert ..." | Re-trigger Greptile}

greptile-apps · 2026-03-27T06:14:41Z

src/features/cfg.ts

+      if (!needsJsFallback) {
+        const processed = native.bulkInsertCfg(db.name, batches);
+        const withBlocks = batches.filter((b) => b.blocks.length > 0).length;
+        if (processed > 0) {
+          info(`CFG: ${withBlocks} functions analyzed (native bulk)`);
+        }
+        return;


Silent data loss on bulk-insert failure — no JS fallback

bulk_insert_cfg returning 0 indicates a Rust-side error (DB connection failure, missing table, failed transaction, etc.). The current code unconditionally returns after calling it, meaning any such failure silently drops all CFG writes without attempting the JS fallback path. This diverges from the established pattern in ast.ts, which explicitly checks the count and falls through on failure.

Compare ast.ts:

if (inserted === expectedNodes) { debug(`AST extraction (native bulk): ${inserted} nodes stored`); return; } debug(`AST extraction (native bulk): expected ${expectedNodes}, got ${inserted} — falling back to JS`); // fall through to JS path

The CFG path should follow the same pattern — check the return value and only return on success:

if (!needsJsFallback) { const processed = native.bulkInsertCfg(db.name, batches); const withBlocks = batches.filter((b) => b.blocks.length > 0).length; if (processed > 0 || withBlocks === 0) { if (processed > 0) info(`CFG: ${withBlocks} functions analyzed (native bulk)`); return; } debug(`CFG: bulk insert returned 0 — falling back to JS`); // fall through }

Fixed — cfg.ts now checks processed === batches.length and falls through to the JS path on mismatch, matching the ast.ts reference pattern. Also handles the empty-blocks edge case by checking expectedFunctions === 0.

greptile-apps · 2026-03-27T06:14:42Z

src/features/dataflow.ts

+    if (!needsJsFallback) {
+      const inserted = native.bulkInsertDataflow(db.name, batches);
+      info(`Dataflow: ${inserted} edges inserted (native bulk)`);
+      return;
+    }


Silent data loss on bulk-insert failure — no JS fallback

Same issue as in cfg.ts: bulk_insert_dataflow returning 0 silently drops all dataflow writes without attempting the JS fallback. If all batches have edges but Rust returns 0, no edges are written and no fallback is attempted while the caller observes a clean return.

The ast.ts pattern protects against this by checking inserted === expectedNodes. The dataflow path should do the same:

if (!needsJsFallback) { const inserted = native.bulkInsertDataflow(db.name, batches); const expectedEdges = batches.reduce((s, b) => s + b.edges.length, 0); if (inserted === expectedEdges || expectedEdges === 0) { info(`Dataflow: ${inserted} edges inserted (native bulk)`); return; } debug( `Dataflow: bulk insert expected ${expectedEdges} edges, got ${inserted} — falling back to JS`, ); // fall through to JS path }

Fixed — dataflow.ts now checks inserted === expectedEdges and falls through to the JS path on mismatch, matching the ast.ts reference pattern.

…sert # Conflicts: # src/types.ts

Match the ast.ts reference pattern: check the Rust return value against the expected count and fall through to the JS path when they disagree, preventing silent data loss on Rust-side errors.

carlos-alm · 2026-03-27T07:24:09Z

Addressed all Greptile feedback in 1a20f46:

P1 (cfg.ts): Added count-mismatch fallback — processed === batches.length check before returning, with debug() log and fall-through to JS on mismatch.
P1 (dataflow.ts): Same pattern — inserted === expectedEdges check before returning.
P2 (cfg_db.rs empty-blocks count): The JS-side check now compares against batches.length (total functions including empty-block ones), so the Rust return value aligns correctly. The empty-blocks edge case is handled by the expectedFunctions === 0 guard.

All 11 CFG and 23 dataflow integration tests pass.

carlos-alm · 2026-03-27T07:24:45Z

@greptileai

carlos-alm added 9 commits March 26, 2026 22:14

fix(ast): add busy_timeout pragma to Rust SQLite connection (#651)

89a1ddb

The Rust connection omitted busy_timeout = 5000 which the JS-side connection.ts sets. Without it, SQLITE_BUSY is returned immediately on WAL contention instead of retrying for 5 seconds.

docs(cargo): document rusqlite bundled feature rationale (#651)

d4cbd32

Explain why bundled is intentional: Windows CI lacks system SQLite, and dual-instance WAL coordination is OS-safe.

Merge branch 'main' into perf/ast-node-bulk-insert

51bf006

Merge branch 'perf/ast-node-bulk-insert' of https://github.com/optave…

b406301

…/codegraph into perf/ast-node-bulk-insert

greptile-apps bot reviewed Mar 27, 2026

View reviewed changes

carlos-alm added 2 commits March 27, 2026 01:19

Merge remote-tracking branch 'origin/main' into perf/ast-node-bulk-in…

fb49aed

…sert # Conflicts: # src/types.ts

fix(db): add JS fallback on bulk-insert count mismatch (#653)

1a20f46

Match the ast.ts reference pattern: check the Rust return value against the expected count and fall through to the JS path when they disagree, preventing silent data loss on Rust-side errors.

carlos-alm merged commit b302648 into main Mar 27, 2026
27 checks passed

carlos-alm deleted the perf/ast-node-bulk-insert branch March 27, 2026 07:51

github-actions bot locked and limited conversation to collaborators Mar 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(db): bulk CFG and dataflow DB writes via rusqlite#653

perf(db): bulk CFG and dataflow DB writes via rusqlite#653
carlos-alm merged 11 commits intomainfrom
perf/ast-node-bulk-insert

carlos-alm commented Mar 27, 2026

Uh oh!

greptile-apps bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

greptile-apps bot Mar 27, 2026

Uh oh!

carlos-alm Mar 27, 2026

Uh oh!

greptile-apps bot Mar 27, 2026

Uh oh!

carlos-alm Mar 27, 2026

Uh oh!

carlos-alm commented Mar 27, 2026

Uh oh!

carlos-alm commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlos-alm commented Mar 27, 2026

Summary

Details

Test plan

Uh oh!

greptile-apps bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm commented Mar 27, 2026

Uh oh!

carlos-alm commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps bot commented Mar 27, 2026 •

edited

Loading